{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup: Install Dependencies, Imports & Download Embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-06-06T17:00:06.460001Z",
     "start_time": "2024-06-06T17:00:04.214098Z"
    }
   },
   "outputs": [],
   "source": [
    "!pip install matplotlib tqdm pandas numpy datasets --quiet --upgrade"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-06-06T17:00:07.041784Z",
     "start_time": "2024-06-06T17:00:06.461658Z"
    },
    "id": "WBVTItUX4yyr"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from datasets import load_dataset\n",
    "from tqdm import tqdm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 👨🏾‍💻 Code Walkthrough\n",
    "Here's an explanation of the code structure provided:\n",
    "\n",
    "1. **Loading Data**: OpenAI embeddings are loaded from a parquet files (we can load upto 1M embedding) and concatenated into one array.\n",
    "2. **Binary Conversion**: A new array with the same shape is initialized with zeros, and the positive values in the original vectors are set to 1.\n",
    "3. **Accuracy Function**: The accuracy function compares original vectors with binary vectors for a given index, limit, and oversampling rate. The comparison is done using dot products and logical XOR, sorting the results, and measuring the intersection.\n",
    "4. **Testing**: The accuracy is tested for different oversampling rates (1, 2, 4), revealing a correctness of ~0.96 for an oversampling of 4.\n",
    "\n",
    "\n",
    "## 💿 Loading Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-06-06T17:01:09.343230Z",
     "start_time": "2024-06-06T17:00:07.042526Z"
    },
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 250
    },
    "id": "REJpFqkG7EG2",
    "outputId": "7a43c0ae-fbcc-45fe-fd58-bfe691297b22"
   },
   "outputs": [],
   "source": [
    "# Download from Huggingface Hub\n",
    "ds = load_dataset(\n",
    "    \"Qdrant/dbpedia-entities-openai3-text-embedding-3-large-3072-100K\", split=\"train\"\n",
    ")\n",
    "openai_vectors = np.array(ds[\"text-embedding-3-large-3072-embedding\"])\n",
    "del ds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-06-06T17:01:10.900963Z",
     "start_time": "2024-06-06T17:01:09.344842Z"
    }
   },
   "outputs": [],
   "source": [
    "openai_bin = np.zeros_like(openai_vectors, dtype=np.int8)\n",
    "openai_bin[openai_vectors > 0] = 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-06-06T17:01:10.906827Z",
     "start_time": "2024-06-06T17:01:10.901820Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": "3072"
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "n_dim = openai_vectors.shape[1]\n",
    "n_dim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🎯 Accuracy Function\n",
    "\n",
    "We will use the accuracy function to compare the original vectors with the binary vectors for a given index, limit, and oversampling rate. The comparison is done using dot products and logical XOR, sorting the results, and measuring the intersection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-06-06T17:01:10.909730Z",
     "start_time": "2024-06-06T17:01:10.908166Z"
    },
    "id": "FqshI-GlIERd"
   },
   "outputs": [],
   "source": [
    "def accuracy(idx, limit: int, oversampling: int):\n",
    "    scores = np.dot(openai_vectors, openai_vectors[idx])\n",
    "    dot_results = np.argsort(scores)[-limit:][::-1]\n",
    "\n",
    "    bin_scores = n_dim - np.logical_xor(openai_bin, openai_bin[idx]).sum(axis=1)\n",
    "    bin_results = np.argsort(bin_scores)[-(limit * oversampling) :][::-1]\n",
    "\n",
    "    return len(set(dot_results).intersection(set(bin_results))) / limit"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 📊 Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-06-06T17:01:25.206592Z",
     "start_time": "2024-06-06T17:01:10.911971Z"
    },
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "qtzUlq_sFTRf",
    "outputId": "17fe04ea-4f73-4a57-990b-180f1c04b472"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|          | 0/4 [00:00<?, ?it/s]\n",
      "  0%|          | 0/2 [00:00<?, ?it/s]\u001b[A\n",
      " 50%|█████     | 1/2 [00:02<00:02,  2.05s/it]\u001b[A"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'sampling_rate': 1, 'limit': 3, 'mean_acc': 0.9}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "100%|██████████| 2/2 [00:04<00:00,  2.02s/it]\u001b[A\n",
      " 25%|██▌       | 1/4 [00:04<00:12,  4.05s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'sampling_rate': 1, 'limit': 10, 'mean_acc': 0.8300000000000001}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "  0%|          | 0/2 [00:00<?, ?it/s]\u001b[A\n",
      " 50%|█████     | 1/2 [00:01<00:01,  1.72s/it]\u001b[A"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'sampling_rate': 2, 'limit': 3, 'mean_acc': 1.0}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "100%|██████████| 2/2 [00:03<00:00,  1.76s/it]\u001b[A\n",
      " 50%|█████     | 2/4 [00:07<00:07,  3.75s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'sampling_rate': 2, 'limit': 10, 'mean_acc': 0.9700000000000001}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "  0%|          | 0/2 [00:00<?, ?it/s]\u001b[A\n",
      " 50%|█████     | 1/2 [00:01<00:01,  1.72s/it]\u001b[A"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'sampling_rate': 3, 'limit': 3, 'mean_acc': 1.0}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "100%|██████████| 2/2 [00:03<00:00,  1.69s/it]\u001b[A\n",
      " 75%|███████▌  | 3/4 [00:10<00:03,  3.58s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'sampling_rate': 3, 'limit': 10, 'mean_acc': 0.9800000000000001}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "  0%|          | 0/2 [00:00<?, ?it/s]\u001b[A\n",
      " 50%|█████     | 1/2 [00:01<00:01,  1.68s/it]\u001b[A"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'sampling_rate': 5, 'limit': 3, 'mean_acc': 1.0}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "100%|██████████| 2/2 [00:03<00:00,  1.65s/it]\u001b[A\n",
      "100%|██████████| 4/4 [00:14<00:00,  3.57s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'sampling_rate': 5, 'limit': 10, 'mean_acc': 0.99}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "number_of_samples = 10\n",
    "limits = [3, 10]\n",
    "sampling_rate = [1, 2, 3, 5]\n",
    "results = []\n",
    "\n",
    "\n",
    "def mean_accuracy(number_of_samples, limit, sampling_rate):\n",
    "    return np.mean(\n",
    "        [accuracy(i, limit=limit, oversampling=sampling_rate) for i in range(number_of_samples)]\n",
    "    )\n",
    "\n",
    "\n",
    "for i in tqdm(sampling_rate):\n",
    "    for j in tqdm(limits):\n",
    "        result = {\n",
    "            \"sampling_rate\": i,\n",
    "            \"limit\": j,\n",
    "            \"mean_acc\": mean_accuracy(number_of_samples, j, i),\n",
    "        }\n",
    "        print(result)\n",
    "        results.append(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ㆓ Binary Conversion\n",
    "\n",
    "Here, we will use 0 as the threshold for the binary conversion. All values greater than 0 will be set to 1, and others will remain 0. This is a simple and effective way to convert continuous values into binary values for OpenAI embeddings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-06-06T17:01:25.247495Z",
     "start_time": "2024-06-06T17:01:25.213508Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>sampling_rate</th>\n      <th>limit</th>\n      <th>mean_acc</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1</td>\n      <td>3</td>\n      <td>0.90</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1</td>\n      <td>10</td>\n      <td>0.83</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>2</td>\n      <td>3</td>\n      <td>1.00</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>2</td>\n      <td>10</td>\n      <td>0.97</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>3</td>\n      <td>3</td>\n      <td>1.00</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>3</td>\n      <td>10</td>\n      <td>0.98</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>5</td>\n      <td>3</td>\n      <td>1.00</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>5</td>\n      <td>10</td>\n      <td>0.99</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
      "text/plain": "   sampling_rate  limit  mean_acc\n0              1      3      0.90\n1              1     10      0.83\n2              2      3      1.00\n3              2     10      0.97\n4              3      3      1.00\n5              3     10      0.98\n6              5      3      1.00\n7              5     10      0.99"
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results = pd.DataFrame(results)\n",
    "results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "machine_shape": "hm",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
